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FOREWORD 


The  enclosed  paper  is  scheduled  to  appear  in  the  volume  Vistas  in  Infor¬ 
mation  Handling  and  is  being  distributed  in  the  present  form  in  the 
interests  of  prompt  reporting  of  work  in  progress. 

Although  some  of  the  material  covered  was  previously  discussed  in  Sec¬ 
tions  III  and  IV  of  our  first  report,*  this  paper  develops  the  linear 
method  of  association  more  simply  and  concisely.  We  have,  for  example, 
revised  the  mathematical  development  so  that  the  linear  transformations 
employed  in  the  lineal'  network  approach  to  relevance -ranking  are  ex¬ 
hibited  as  matrices.  This  new  formulation  enables  us  to  show  that  the 
relevance-ranking  transformation  can  be  represented  as  the  product  of 
three  (intuitively  meaningful)  mappings  in  a  fairly  natural  manner; 
an  index  term  association  mapping,  a  mapping  from  index  term  values  to 
document  values,  and  a  document  association  mapping.  In  addition  to 
mathematical  simplifications,  new  material  is  also  introduced  on  a 
hypothesis  relating  to  the  breakdown  of  "association"  into  a  'synonymy" 
and  a  "contiguity’  portion.  These  types  of  relationships  are  tentatively 
explored  in  terms  of  the  matrix  formulation.  A  portion  of  Section  III 
of  Report  CACL-1  is  included  almost  intact  as  an  appendix  to  the  paper; 
it  is  repeated  in  this  report  for  completeness. 

It  is  felt  that  the  more  compact  presentation  given  in  this  paper  will 
simplify  and  augment  our  earlier  discussion  of  the  linear  network  model, 
and  the  paper  is  distributed  for  this  purpose.  Since  it  treats  only 
one  of  several  topics  under  investigation,  it  should  not  be  construed 
as  a  complete  report  of  our  activities  under  the  contract. 
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ABSTRACT 


This  paper  is  concerned  with  the  recognition  and  exploitation  of  term 
associations  for  the  retrieval  of  documents.  A  general  theory  of  asso¬ 
ciation  and  associative  retrieval  is  presented;  it  is  based  on  tne  use 
of  linear  transformations,  both  for  establishing  associations  among 
terms  and  for  discriminating  among  documents.  The  design  and  behavior 
of  a  simple  experimental  device  which  realizes  the  theory  is  discussed. 


LINEAR  ASSOCIATIVE  INFORMATION  RETRIEVAL* 


"The  real  heart  of  the  matter  of  selection,  however,  goes 
deeper  than  a  lag  in  the  adoption  of  mechanisms  by  libraries, 
or  a  lack  of  development  of  devices  for  their  use.  Our 
ineptitude  in  getting  at  the  record  is  largely  caused  by 
the  artificiality  of  systems  of  indexing.  When  data  of 
any  sort  are  placed  in  storage,  they  are  filed  alphabet¬ 
ically  or  numerically,  and  information  is  found  (when  it 
is)  by  tracing  it  down  from  subclass  to  subclass.  It  can 
be  in  only  one  place,  unless  duplicates  are  used;  one  has 
to  have  rles  as  to  which  path  will  locate  it,  and  the  rules 
are  cumbersome.  Having  found  one  item,  moreover,  one  has 
to  emerge  from  the  system  and  re-enter  on  a  new  path. 

‘‘-The  human  mind  does  not  work  that  way.  It  operates  by 
association.  With  one  item  in  its  grasp,  it  snaps  instant¬ 
ly  to  the  next  that  is  suggested  by  the  association  of 
thoughts,  in  accordance  with  some  intricate  web  of  trials 
carried  by  the  cells  of  the  brain  ....  Man  cannot  hope  to 
duplicate  this  mental  process  artificially,  but  he  certain¬ 
ly  ought  to  be  able  to  learn  from  it. 

"As  We  May  Think,  ’  by  Vannevar  Bush,  from 

Atlantic  Monthly.  July  1945. 


I.  INTRODUCTION 

This  paper  is  concerned  with  the  recognition  and  exploitation  of 
term  asscv  iations  for  the  retrieval  of  documents.  A  general  theory  of 
association  and  associative  retrieval  is  presented;  it  is  based  on  the 
use  of  linear  transformations,  both  for  establishing  associations 
between  terms  and  for  discriminating  between  documents.  A  simple  ex¬ 
perimental  device  which  realizes  the  theory  has  been  built,  and  examples 
of  its  operation  are  discussed  in  the  Appendix. 

A  document  retrieval  system  may  be  generally  characterized  in  the  follow 
ing  manner;  a  collection  of  d  documents  and  a  set  of  t  index 
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terms  are  presumed  to  be  given,  and  it  is  assumed  that  each  document 
has  been  indexed  by  the  assignment  to  it  of  one  or  several  applicable 
index  terms. Based  upon  such  an  inde;:ing  of  the  document  collection, 
a  retrieval  system  ideally  functions  so  as  to  identify  exactly  those 
documents  which  are  relevant  to  an  inquiry  consisting  of  a  specifica¬ 
tion  of  one  or  more  pertinent  index  terms . 

To  proceed  more  formally,  an  inquiry  may  be  regarded  as  consisting  of  an 
assignment  of  positive  "importance  values,  not  all  necessarily  equal, 
to  some  of  those  index  terms  which  most  directly  characterize  the 
matter  of  interest  to  the  inquirer,  and  an  assignment  of  value  zero 
to  all  other  index  terms.  It  is  convenient  to  consider  the  inquiry 
to  be  represented  by  a  t  dimensional  column  vector  Q  ,  in  which 
each  component  q^  exhibits  the  value  assigned  to  the  ith  index  term 
by  the  inquirer. 

Likewise,  the  response  of  a  retrieval  system  to  an  inquiry  v;ill  be  re¬ 
garded  to  be  an  assignment  of  nonnegative  values  for  retrieval  to  all 
documents  in  the  collection,  where  the  values  reflect  relevance  to  the 
inquiry.  This  assignment  defines  a  d  dimensional  response  vector  R  , 
in  which  r.  exhibits  the  value  assigned  to  a  document  j  by  the 
system  in  response  to  a  given  inquiry  Q  . 

The  retrieval  process  can  therefore  be  viewed  as  a  mathematical  trans¬ 
formation  from  an  inquiry  vector  Q  to  a  response  vector  R  ,  and  there 
is  no  loss  in  generality  in  doing  so. 


*  In  practice,  the  index  terms  may  be  either  descriptors  assigned  by 
manual  indexing  procedures,  or  keywords  selected  from  the  text  by 
automatic  means.  This  paper  is  not  concerned  v;ith  the  relative 
merits  of  these  two  different  philosophies  of  index  term  assign¬ 
ment,  but  rather  is  intended  to  suggest  a  method  of  retrieval  which 
will  yield  improved  results  with  either. 


The  relative  magnitude  of  the  values  define  an  ordering  on  the 

documents-  In  an  effective  retrieval  system,  the  ordering  of  document 
values  should  reflect  the  actual  ordering  of  relevance  of  the  documents, 
and  the  documents  should  be  retrieved  in  the  decreasing  order  of  this 
value. * 


Associative  Retrieval  Methods 

An  "associative"  retrieval  system  is  one  which  attempts  to  take  possible 
interconnections  among  index  terms  into  account  in  performing  the  re¬ 
trieval  transformation.  One  objective  of  introducing  association  is 
to  give,  as  a  response,  an  ordering  cf  documents  ranked  on  a  continuous 
scale  of  relevance  rather  than  artificially  grouped  into  two  classes. 
Another  objective  is  to  free  the  requestor  from  a  necessity  to  couch  his 
inquiry  in  precisely  the  same  terms  employed  by  the  indexer. 

The  role  of  term  association  within  an  associative  retrieval  scheme  is 
roughly  as  follows.  Each  index  term  is  regarded  to  bear  some  stronger 
or  weaker  measure  of  association  with  each  other  index  term.  A  docu¬ 
ment  is  evaluated  with  respect  to  a  given  inquiry  not  only  by  consider¬ 
ing  the  presence  of  index  terms  intersecting  the  document  and  inquiry, 
but  also  considering  the  presence  of  other  terms  in  the  document  which 
may  in  turn  be  strongly  associated  with  terms  in  the  inquiry.  Thus,  for 
example,  given  an  inquiry  about  the  "production  of  "automobiles"  a 
system  may  assign  a  high  value  to  a  document  about  the  "manufacture"  of 
"motor  vehicles  '  provided  that  the  corresponding  terms  are  knovm  to  be 

,  highly  associated. 

\, 

\ 

*  In  most  existing  operational  retrieval  systems,  of  course,  only 
tv;o  levels  of  value  are  recognized,  ’relevant  "  and  "irrelevant. 

Such  a  boundary  line  is  usually  an  artificial  one,  however,  and 
hence  is  the  source  of  many  errors  in  retrieval;  either  ’ 'relevant  ’ 
documents  are  not  in  fact  relo'i'ant,  or  "irrelevant"  documents  are 
in  fact  relevant  -  a  phenomenon  V7uil  known  to  users  of  document  re¬ 
trieval  systems. 


A  retrieval  system  embodying  an  automatic  thesaurus  thus  qualifies  as  being 
associative"  in  the  strict  sense  stated  above.  In  this  paper  however, 
we  will  be  primarily  concerned  with  the  case  in  which  the  associations 
are  based  on  formal  statistical  relationships  present  within  a  given  docu¬ 
ment  collection.  A  brief  speculative  discussion  of  possible  linguistic 
interpretations  which  might  be  assigned  to  such  formal  associations  will 
be  given  in  Section  IV. 


Methods  for  performing  retrieval  transformations  with  the  use  of  term 
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associations  have  been  discussed  in  the  literature,  ’  *  ’  but,  to  our 
knowledge,  the  use  of  any  but  the  most  trivial  linear  transformations  has 
never  been  proposed  previously--a  surprising  fact  since  linear  transforma¬ 
tions  are  so  well  understood  and  since  they  have  been  applied  so  widely 
elsewhere.  V7e  shall  explore  the  implicatiopB  of  assuming  that  the 
retrieval  transformation  is  linear.  Suppose  that  the  linear  transforma¬ 
tion  is  represented  by  a  d  x  t  matrix  P  so  that 

R  =  r  Q  (1) 


It  will  be  shown  that  ^ can  be  viewed  as  the  product  of  three  separate 
linear  transformations,  each  of  which  has  a  meaning  for  retrieval.  These 
will  be  represented  by  matrices  ,  0  ,  and  so  that 


R 


=  ^©/T.q 


(2) 


In  this  formula,  -/ C represents  a  t  x  t  index  term  association  matrix, 
so  that  the  vector  y^L  Q  represents  a  column  vector  of  values  of  index 
terms  after  the  effects  of  term  association  are  taken  into  account;  the 
value  assigned  to  a  term  in  reflects  how  closely  related  that  term 

is  to  the  terms  specified  in  the  Inquiry.  Thus,  for  example,  although  in 
Q  "production"  may  have  value  one  hundred  and  "manufacture  value  zero, 
in  -/T-Q  "manufacture"  will  also  have  positive  value,  say  eighty.  The 
matrix  O  is  a  d  x  t  matrix  which  attributes  values  to  documents  based 
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on  the  values  of  the  terms.  It  is  called  a  discriminant  matrix,  and 
the  result  ^  is  in  the  form  of  a  document  response  vector. 

A  final  mapping,  the  d  x  d  matrix  ^ ,  takes  into  account  inter¬ 
actions  (if  any)  among  documents.  Such  interactions  may  arise,  for 
example,  vjhen  the  paragraphs  of  an  article  or  the  chapters  of  a  book 
are  treated  as  distinct  doauraentr .  Since  the  matrix  Cp  performs  a 
transformation  which  takes  into  account  knov;n  associations  among  docu¬ 
ments,  it  is  called  a  document  association  matrix. 

The  ov.-rall  retrieval  transformatica  j.s  thus  viewed  as  a  produc ;  of 
three  linear  transformations:  an  index  terra  association  transforma¬ 
tion,  a  transformation  from  index  cerm  values  to  document  values,  and 
a  document  association  transformation. 


II .  DETERMINING  THE  ASSOCIATION  MAPPINGS 

The  linear  associative  transformations  may  be  developed  from  at 
least  three  equivalent  points  of  view: 

1.  Reasoning  along  probabilistic  lines,  in  which  term- term  associa¬ 
tion  is  regarded  to  be  a  Markov  process. 

2.  Reasoning  based  upon  an  electrical  network  analog. 

3.  Reasoning  based  upon  the  imposition  of  certain  mathematical  constraints 
on  association  and  identification  transformations,  primarily  con¬ 
sisting  of  certain  assumptions  of  linearity  and  normalizability  of 
transformation  matrices. 

All  three  approaches  are  ultimately  equivalent  in  that  they  lead  to 
the  same  set  of  mathematical  formulas.  The  approaches  differ  in  the 
interpretations  they  provide;  each  gives  a  different  avenue  of  appeal 
to  intuition.  The  third  approach  will  be  pursued  in  this  Section,  and  the 
relationship  to  the  electrical  network  analog  approach  will  be  outlined 
in  Section  III. 
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A.  Case  of  the  Index  Term-Document  Network 

It  will  be  asouned  that  every  document  1  is  connected  to  each 
index  term  j  contained  within  that  document  by  a  bond  of  atrength  ^  0, 

and  that  this  strength  can  be  determined  by  simple  formal  properties 
of  the  document  and  index  term.  This  number  might,  for  example,  be 
given  by  the  frequency  of  occurrence  of  the  index  term  in  the  document, 
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or  it  might  possibly  be  determined  with  the  aid  of  syntactic  information.  ’ 

The  d  X  t  matrix  C.  ,  will  be  called  the  document -index  term  connec- 

ij  - 

tion  matrix  for  the  corpus. 

1.  Conventional  Term  Retrieval 

Almost  all  conventional  (coordinate)  retrieval  systems  rank 
documents  for  retrieval  according  to  the  value  R  obtained  from  the 
simple  linear  mapping: 


R  =  C  Q 


(3) 


In  the  usual  case  the  values  in  both  C  and  Q  are  restricted  to  1 
or  0,  and  it  is  evident  that  those  documents  having  all  the  k  index 
terms  specified  with  value  1  in  Q  are  ranked  first,  followed  by  those 
documents  containing  subsets  of  k  -  1  index  terms  specified  with  value 
1  in  Q  ,  etc. 


It  is  clear,  however,  that  the  mapping  (3)  attributes  nonzero  values 
for  retrieval  only  to  documents  having  at  least  one  term  in  common  with 
the  inquiry  Q  ,  and  there  is  no  provision  for  retrieval  if  synonymous  or 
otherwise  associated  terms  are  used  in  the  document  and  inquiry.  Thus 
documents  on  the  'production"  of  "artomobiles"'  might  be  missed  if  an 
inquiry  is  phrased  in  teirms  of  the  "manufacture"  of  motor  vehicles." 
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2.  Aasociative  Retrieval 

(a)  Expansion  in  Powers  of  Index  Term  Connectiona 

A  formal  approach  to  the  association  problem  is  to  revise 

the  mapping  (3)  to  take  into  account  index  term  interconnections  as 

defined  by  the  corpus  itself.  This  may  be  accomplished  by  using  powers 

T 

of  the  term- term  connection  matrix  K  =  C  C.  A  typical  element 

k  =  /  C.  C.  of  this  matrix  gives  a  measure  of  interconnection 
rs  A-.~^  ir  IS  ° 

between  terms  r  and  s  via  all  documents  that  contain  both  of  them. 

2 

An  element  k  of  K  gives  a  measure  of  interconnection  between  terms 
rs 

r  and  s  via  all  pairs  of  documents  such  that  one  contains  r  ,  the 
other  s  ,  and  both  share  one  or  more  other  index  terms.  Similarly, 
higher  powers  of  K  give  measures  of  interconnection  via  longer  and 
longer  paths  of  documents  and  terms.  Obviously,  by  taking  a  weighted 
sum  of  powers  of  K  it  is  possible  to  obtain  an  association  matrix 
which  reflects  the  total  effect  of  all  paths  of  every  length;  indeed 
this  is  what  v/ill  be  done.  For  ouch  a  weighted  sum  of  powers  of  K 
to  be  meaningful  for  retrieval  purposes,  however,  it  is  first  necessary 
to  select  the  weights  so  that  the  strengths  of  association  for  shorter 
paths  count  more  than  those  for  longer  paths.  In  fact,  association 
strengths  for  longer  and  longer  paths  should  approach  zero,  that  is, 

(4) 


lim  ^  0 


n — ^  OO 

if  K  is  a  properly  normalized  (weighted)  term- term  connection  matrix. 


(b)  Selection  of  a  Normalization 

Since  all  elements  of  K  are  non-negative,  a  sufficient 
condition  for  convergence  (4)  ic  that  all  of  the  vows  of  K  sum  to  less 
than  unity.  This  will  hold,  in  turn,  if 


T'"-' 
K  =  C  C 


(5) 


K  =  Ak  =  4  C  C 


(6) 
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/-V^ 

T  ^ 

where  the  row  sums  o£  C  and  C  are  normalized  to  unity,  and  where 
A  is  a  t  by  t  diagonal  matrix  with  all  Oj^  K.  1  .  Specifically, 
we  take 


c  = 


(j-Q.  and 


(7) 


where  <7~'  and 


are  diagonal  matrices  given  by 


(note 


and  .  .  -  1 


i 


(8) 


A  normalization  constant  < 1  is  presumed  to  be  assigned  to  each 

index  term  so  that,  very  roughly  speaking,  1  -  1  determines  the 
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cost  of  associating  from  one  document  to  another  through  index  term  i  . 
The  intuitive  meaning  of  this  normalization  will  become  clearer  to  the 
reader  as  he  proceeds  through  the  discussion  of  the  remainder  of  this 

Saotion,  Section  III,  end  the  Appendix. 


(c)  Assumptions  of  Linearity 

Now  we  are  prepared  to  obtain  the  actual  form  of  tite  linear 
associative  retrieval  mapping,  which  follows  from  two  explicit  assump¬ 
tions  of  linearity; 


a.  The  value  of  a  document  is  a  linear  function  of 
the  values  of  the  index  terms  contained  in  it, 
wher^  the  coefficients  of  the  function  are  given 
by  C  .  That  is,  if  W  is  the  vector  of  index 
term  values , 


R  =  C  W 


(9) 
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b.  The  value  of  an  index  term  io  given  by  taking 
the  sum  of  its  original  value  aosigpgd  by  the 
inquiry  Q  and  a  linear  function  of  the 
values  of  the  documents  containing  it, 
specifically: 


W  =  ^  C^R  +  Q 


(10) 


The  desired  retrieval  mapping  follows  directly  from  (9)  and  (10); 


-  -1 

-1 

R  =  C 

1  -  y  K  J  Q  =  C 

I  -  '^c^c' 

Q 

(11) 


which  can,  given  (4),  be  written  as  a  convergent  series; 


R  =  C 

I  -1-  (  ^  K)  +  (  K)^  +  (  /^  ^)^  +  ... 

Q 

L  j 

(12) 


Equation  (12)  is  the  desired  mapping  which  takes  into  account  the  weighted 
and  summed  effect  of  association  paths  of  all  lengths.  It  is  obviously 
a  generalization  of  the  conventional  retrieval  mapping  (3).  It  may  be 
noted  first  of  all  that  this  is  of  the  form  (2)  where  C  =  ^  is  the 
discriminant  matrix,  I  -  iT  =  yl.. is  the  index  term  associa¬ 
tion  matrix^  and  since  there  are  no  direct  document  interconnections, 

-  I,  the  identity  mapping.  It  may  be  noted  next  that  rapidity  of 
convergence  is  determined  by  ,  and  that  for  =  0  ,  the  conventional 


*  Matrices  of  this  form  have  been  applied  to  the  study  of  indirect 
interactions  among  sectors  of  the  economy  by  Leontief  and  others,^ 
and  are  often  referred  to  in  the  literature  as  "Leontief  Matrices". 
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term  retrieval  mapping  (3)  io  obtained.  By  regulating  the  valueo  of  , 
aosociation  can  be  either  "free"  (for  /')  's  near  1)  or  "narrow"  (for 
's  near  0).  The  effectiveness  of  retrieval  using  this  method  will 


be  discussed  in  Section  V. 


B.  Case  with  Interdocument  Linkages  and  Inter-Index  Term  Linkages 

The  discussion  in  the  previous  Subsection  presumed  the  absence  of 
direct  linkages  between  index  terms  and  other  index  terms,  as  well  as 
the  absence  of  direct  linkages  between  documents  and  other  documents. 
Introduction  of  such  direct  linkages  may  prove  useful,  however,  for 
two  principal  reasons:  First  of  all,  if  documents  are  themselves  inter¬ 
related  such  as  by  being  chapters,  sections,  or  even  paragraphs  within 
a  given  book,  it  may  be  desirable  to  reflect  these  relationships  with 
inter-document  links.  Information  present  in  citations  might  also  be 
used  to  generate  ouch  inter-document  links.  In  the  extreme  case,  when 
individual  documents  are  sentences  in  a  stream  of  writing  or  speech, 
strong  intersentence  linkages  can  provide  in  part  for  the  effect  of 
antecedence. 


Secondly,  it  could  conceivably  be  desirable  to  establish  a  priori  linkages 
between  synonyms.  Such  linkages  would  provide  a  direct  means  of  introduc¬ 
ing  "semantic"  information  not  present  in  the  corpus  itself.'* 

The  remainder  of  this  Section  is  therefore  concerned  with  generalizing  the 
mathematical  treatment  of  Part  A  to  include  the  situation  when  direct 
document -document  links  and  direct  term- term  links  are  permitted.  In 

*  Although  the  presentation  is  extended  in  this  Subsection  to  accom¬ 
modate  the  case  when  a  priori  links  between  index  terms  are  presumed 
given,  there  io  reason  to  believe  that  such  "thesaurus"  entries  will 
in  fact  be  unneeded;  such  has  been  the  general  experience  of  other 
workers  on  associative  indexing  schemes . ^ • 2 . 7  xhe  writers  conjecture, 
in  fact,  that  the  linear  transformations  to  be  developed  in  this  Sub¬ 
section  might  be  used  to  generate  a  "thesaurus  '  valid  for  a  given 
corpus  directly  from  the  C  matrix  for  that  corpus. 


this  more  general  case  the  connection  matrix  is  square  of  dimension  d  +  t 
partitioned  as  follows: 


G  = 


(13) 


‘®td  ^tt. 


In  this  representation,  C  is  the  document -term  connection  matrix  employed 
before,  B  is  its  transpose,  A  the  document-document  connection  matrix, 
and  D  the  term- term  connection  matrix.  The  normalization  is  obtained  by 
multiplying  on  the  left  by  a  diagonal  nacrix,  analogous  to  (8)  giving: 


G  ^ 


A- 


c'\ 


(14) 


where 


ii 


\ 


ii 


— d. 


z 


t  +  d 


j  -  d  +  1 


and  . 
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ii 


^  D. . 

>  ,  IJ  >  V 


Again,  in  this  more  general  case  we  still  require  the  solution  to  satisfy 
linear  constraints  analogous  to  (9)  and  (10) 


A  /R\ 


.B  D 


and  therefore : 


-1 


'R\  /I  -  A  -C\  ‘  /o' 

w  -H 

\W/  \-B  I  -  D/  \  Q 

This  can  be  solved  to  give  the  generalized  association  formulas: 


;  -  1  •  1  P  -  1  r-w"  «.  I 

j  R  =  (I  -  A)  C  (I  -  D)  [I  -  B  (I  -  A)  C  (I  -  D) 


-1 


and 


W  =  (I  -  D) 


-1 


I  -  B  (I  -  A)  C  (I  -  D) 


-1 


(15) 


(16) 


(17a) 


(17b) 


Equation  (17)  may  be  recognized  to  be  in  the  form  of  equation  (2).  The 


factor  ./X.  =  (I  -  D^)  ^ 


*  1  ^  <\y  -  1 

I  -  B  (I  '  A)  C  (I  -  D)  ^ 


-1 


giveo  the 


generalized  term  aosoclation  mapping,  taking  into  account  term- term 
connectionc,  term-document  connections,  and  document-document 
connections.  The  factor  C  =  O  gives  the  discriminant  mapping  from 
term  values  to  document  values.  Finally,  the  (I  ~  A)  ~  ^  factor 
gives  a  document  association  mapping,  which  is  performed  on  the  document 
values.  Obviously,  (17)  is  a  generalization  of  (11)  which  is  a  generaliza¬ 
tion  of  the  conventional  retrieval  mapping  (3). 


In  some  applications  it  may  be  desirable  to  decouple  the  effects  of  the 

document  association  mapping  and  the  discriminant  mapping  from  those  of 

the  term  association  mapping.  This  can  be  accomplished  by  choosing. 

the  c.  ,  -  and.'  b."  values  to  b’e  small  compared  to  the  a.  .  and  d.  , 
ij  i-J 

values.  In  this  case  (17)  essentially  reduces  to 


R  =  (I  -  T)'^  ^  (I  -  Ei)'^  Q 


(18) 


indicating  that  for  practical  purposes 
independently. 


,  and 


can  be  chosen 


Automatic  Thesaurus  Generation 

The  transformation  (i7b)  can  be  used  to  generate  a  thesaurus- like  list¬ 
ing  valid  for  the  given  document  collection.  Suppose  that  no  a  priori 
synonym  linkages  are  given  (i.e.,  D  =  0),  and  that  is  a  unit 


vector  assigning  value  only  to  index  term  z 


W  = 
z 


I  -  B  (I  -  A)  C 


Than  (17b)  becomes 
1 

Q, 


where  the  values  in  W  rank  every  index  term  according  to  its  degree 

z 

of  association  with  index  term  z  .  By  listing  the  few  topmost- ranked 

terms  in  W  for  each  term  z  in  turn,  a  "thesaurus'  listing  can  be 
z 

obtained  completely  automatically.  The  validity  or  usefulness  of  such  a 
listing  must  of  course  be  established  by  experiment. 
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III.  THE  ELECTRICAL  METWORK  ANALOG 

In  principle  at  least,  the  general  linear  retrieval  transformation 
(17)  can  always  be  represented  and  solved  by  means  of  an  electrical 
network  analog.  To  envisage  the  network  v/hich  is  the  analog  of  (17), 
imagine  two  sets  of  electrical  binding  posts,  one  post  for  each  document 
and  one  post  for  each  index  term.  Suppose  nov?  that  a  resistor  vrith 
conductance  g^^  is  soldered  between  every  pair  of  binding  posts  i 
and  j  ,  where  g^^  is  the  connection  matrix  element  in  (13).  Also 
suppose  that  a  "leak"  resistor  with  conductance  is  soldered  between 

each  in  lex  term  post  j  and  some  comion  return  post  0  . 


To  pose  an  inquiry  Q  with  coniponcr.ts  ,  current  is  to  be  injected 

into  the  index  term  binding  posts  in  the  quantity; 

d 

/ 

injected 
into  f 

Then  we  claim  that  the  R  values  will  be  the  voltages  appearing  in  the 
document  binding  posts.  That  this  is  so  can  be  seen  by  writing  the  equa¬ 
tions  which  govern  the  behavior  of  the  network. 


-  ( 


By  conservation  of  current  at  any  document  node  p  we  have: 

d  t  -I-  d 

\  J  ,  + 


j  =  1 


PJ 


j  =  d  +  1 


J  .  =  0 
J  PJ 


(19) 


Now  let  r  be  the  voltage  on  a  document  node  p  ,  w.  be  that  on  an 
P  J 

index  term  node  j  .  We  then  have,  writing  Ohm's  law  out  using  the 


notation  of  (13): 

d 


\  «  (r.-r)  +  ^ —  C  .  (w.  -  r  )  =  0 

.2 _ I  Pj  J  P  .  ,  PJ  J  P 


(20) 


j  =  1 


j  =  d  +  1 


giving: 
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r  = 


A  .  r.  + 
PJ  J 


t  +  d 

5  C  ,  W. 

_ sP3  J 

i  =  d  +  1 


-i-  ^  ^  7  “-±"--  C  . 

P3  _ -  pj 


(21) 


z: 


j  =  1 


j  =  d  +  1 


Likewise,  at  any  term  binding  post,  f  ,  we  have; 

d  t  +  d 


i  =  1 


J.,  + 
-i  if 


i  =  d  +  1 


J.  .  +  J.  .  „  • .  =  0 

_i  if  injected 


(22) 


Applying  Ohm's  law. 


t  +  d 


C^f  B  (r  -  w  )  +  ^ 

i=l  i  =  d+  l 


,  D.  .  (w.  -  V7  )  +  J.  .  „  .  =  0 

if  1  f  injected 


giving; 


w^  = 


B  r.  +  \  -  ■  D..  w. 

/ _ X  if  i  ,  if  1 

i  =  1 _ 


w  I 

3  +  >  +  V 

of  Z  \  if  .<!_ 


r=  d  + 1 

t  +  d 


+  ‘I. 


(24) 


i  =  1  i  =  d  +  1 


Zif 


It  remains  only  to  note  that  equations  (21)  and  (24)  are  formally  the 
same  as  those  given  in  (15). 


The  constants  /) . .  and  C  ,  are  related  in  that; 
JJ  oj 

B.  +  ^  D. 

A  X.  Z  .2 _ i 


(25) 


C  .  +"> 


B.  .  +  >  D.  . 


j  =  I 


ij 


j  »  d  +  1 


Obviously,  by  regulation  of  the  "leak"  conductance  any  value  of 

^  JJ  can  be  obtained  between  1  and  0  .  The  physical  network  therefore 
always  has  a  solution  as  long  as  there  is  at  least  one  path  from  every 


(23) 


terminal  into  which  current  might  be  injected  and  ground.  Therefore, 
as  seen  previously,  a  sufficient  condition  for  equations  (17)  and  (11) 
to  have  solutions  is  that  all  _  <1  .  Finally,  a  linear  associative 
retrieval  transformation  (17)  can  in  principle  always  be  realized  by 
means  of  an  electrical  network. 

IV.  THE  NATURE  OF  ASSOCIATIONS  DERIVED  FROM  TEXT 

A  mathematical  apparatus  has  been  proposed  which,  among  other  things, 
is  capable  of  generating  measures  of  association  between  index  terms 
present  in  a  given  body  of  text.  It  is  interesting  to  speculate  of  the 
nature  of  some  of  the  linguistic  factors  which  these  association  measures 
may  reflect. 

Hopefully,  one  such  factor  will  be  that  of  semantic  overlap  and  partial 
synonymy,  ouch  as  is  exhibited  by  the  pair  of  terms  "production"  and 
"manufacture"  or  by  the  pair  "chair"  and  "seat."  The  notion  of 
synonymy  association  has  strong  intuitive  appeal,  and  associations  of 
this  kind  tend  to  be  relatively  permanent  features  of  the  language. 

A  second  association-producing  factor  also  exists,  however;  it  relates 
not  to  synonymy  but  rather  to  real-world  relationships  between  the 
objects  or  actions  designated  by  the  index  terms.  Associations  due  to 
this  factor,  for  example,  are  exhibited  in  the  relationships  among  "atom," 
"warhead,"  "bomb,"  missile,  and  in  the  relationship  betv/een  "satellite" 
and  "Cape  Canaveral."  Psychologists  sometimes  refer  to  these  associa¬ 
tions  as  being  due  to  "contiguity."  They  tend  to  be  impermanent  and 
to  be  strongly  conditioned  by  the  nature  of  one's  experience.  The  as¬ 
sociation  between  "satellite"  and  "Cape  Canaveral"  could,  for  example, 
disappear  once  again  if  satellites  were  to  be  launched  primarily  from 
a  new  location.  And  while  the  general  public  today  might  agree  that 
"satellite  "  and  "Cape  Canaveral"  are  more  highly  associated  than  "oatellite" 
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and  "telemetry,"  a  population  of  electronic  engineers  might  have  the 
opposite  point  of  view. 

There  is  little  doubt  that  information  about  both  synonymy  association 
and  contiguity  association  could  be  exploited  within  the  context  of  a 
document  retrieval  system,  if  this  information  could  be  made  readily 
available.  As  yet,  however,  there  is  relatively  little  experimental 
evidence  to  Indicate  the  extent  to  v;hich  the  associations  generated  by 
the  linear  association  process  (or  any  other  association  process,  for 
that  matter)  reflect  either  contiguity  or  synonymy.  Nonetheless,  the 
writers  feel  that  it  is  appropriate  to  devote  a  few  paragraphs  to  con¬ 
jectures  which  seem  to  indicate  that  the  linear  process  can  be  used  to 
generate  associations  either  due  to  the  contiguity  and  synonymy  factors 
in  combination,  or  due  to  either  of  these  factors  separately. 

These  conjectures  are  based  on  interpretation  of  the  coefficients  in 
the  right  hand  side  of  the  power  series  expansion  of  the  index  term 
association  matrix,  from  (12): 

(I  -  =  I  +  /i  'ir+  (  Ak)^  +  (  +  .  • .  (26) 

The  leading  coefficient  I  generates  the  identity  association;  the 
fact  that  any  index  term  is  most  highly  associated  with  itself  is  of 
course  a  truism  from  both  the  viewpoints  of  synonymy  and  of  contiguity. 

The  second  coefficient,  A  K  ,  generates  a  measure  of  association 
between  a  pair  of  terms  which  depends  on  the  number  of  times  they  have 
co-occurred  in  the  given  body  of  documents.  We  speculate  that  this 
coefficient  primarily  reflects  contiguity  association,  since  a  pair  of 
index  terms  found  in  a  certain  document  will  in  general  describe  things 
that  have  to  do  with  one  another  in  fact;  indeed,  what  they  have  to  do 
with  one  another  is  often  the  subject  of  the  document.  Depending  on  the 
methods  and  conventions  used  for  indexing,  of  course,  partially  synonymous 
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index  term  may  also  be  used  to  characterize  a  given  document,  but  this  effect 
can  still ~re as ohab ly  be  expected  to  be  overshadowed  by  that  of  contiguity. 

A  first  conjectOire,  then,  is  that  the  /\k  term  primarily  reflects  associa¬ 
tion  due  to  contiguity.  Considering  as.';.ociation  due  to  this  contiguity 
factor  /Ak  alone,  any  given  index  tein  will  bear  a  stronger  or  weaker 
measure  of  K-association  with  every  other  index  term.  For  any  given 
index  term,  the  set  of  K  associations  it  bears  with  all  other  index 
terms  will  be  called  its  contiguity  profile. 


It  is  apparent  that  the  third  coefficient  in  the  series  expansion  (  ■']  K) 
generates  a  measure  of  association  between  index  terms  which  depends  only 
on  the  similarity  of  their  contiguity  profiles.  That  is, 
be  large  if  and  only  if  index  terms  i  and  j  have  similar  contiguity 
profiles.  At  this  point,  it  can  be  conjectured  plausibly  that  synonymous 
index  terms  can  be  identified  and  associated  by  the  similarity  of  their 
contiguity  profiles.  Indeed,  arguments  in  favor  of  this  second  con¬ 
jecture  have  frequently  been  advanced  by  structural  linguists.  Presuming 
its  validity,  it  follows  that  the  (/^K)^  coefficient  gives  a  measure 
of  association  due  primarily  to  synonymy. 
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Moreover,  if  both  of  the  above  conjectures  are  valid,  it  follows  that  all 
of  the  odd  powers  of  K  in  the  series  expansion  primarily  represent 
association  due  to  contiguity,  all  of  the  even  powers  of  K  represent 
association  primarily  due  to  synonymy.  The  association  matrix  we  have 
been  dealing  with  so  far  (26)  therefore  reflects  association  due  o  both 
contiguity  and  synonymy  factors  in  combination.  There  is  nothing,  however, 
to  prevent  use  of  other  transformations  which  represent  these  factors 
separately.  In  particular,  the  association  matrix 


/\K 


/■^  K  +  (  /i  K)^  +  (  +  .  .  . 


(27) 


contains  only  odd  powers,  and  according  to  the  above  conjecture  primarily 
represents  contiguity  association,  the  association  matrix 
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r 
1 1 
L 


(  Aio" 


-1 


I  +  (  /■  K)  +  (  /■'K)  ■  +  (  K)  + 


(28) 


contains  only  even  powers,  and  primarily  represents  synonymy  association. 
A  subsequent  paper  will  deal  vjith  these  points  further,  and  will  des¬ 
cribe  experimental  evidence  relating  to  these  conjectures. 


If  associations  are  based  on  a  finite  body  of  text,  there  is  an  important 
third  association- producing  factor  in  operation  which  depends  on  the 
statistics  of  the  usage  of  the  terms  in  indexing  the  collection.  Suppose, 
for  example,  that  in  a  certain  small  set  of  documents,  "paper"  and 
"tractor"  happen  to  be  used  as  index  terms  in  characterizing  exactly  the 
same  subset  of  documents.  The  inadequacy  of  the  sample  of  documents 
thus  leads  to  the  conclusion  that  "paper’  and  "tractor"  are  strongly 
associated.  This  is,  of  course,  what  we  desire  in  a  retrieval  context. 

The  two  terms  are  wholly  redundant  for  retrieval  since  the  use  of  either 
term  ijn  an  inquiry  will  cause  the  retrieval  of  exactly  the  same  documents. 
An  analogous  situation  arises  V7hen  the  overlap  is  only  partial.  But 
this  third  factor  points  to  a  problem--even  if  associations  derived  from 
text  can  be  expected  to  yield  valid  measures  of  contiguity  and  synonymy 
relationships,  this  will  be  so  only  when  the  body  of  text  is  sufficiently 
large  to  give  statistically  meaningful  results. 


V.  EXPERIMEI^TAL  WORK 

A.  The  ACORN  Devices 

Solution  of  equation  (17)  by  digital  techniques  involves  mul¬ 
tiplication  and  inversion  of  very  large  matrices-- tedious  processes 
even  when  a  very  high  speed  digital  computer  is  available.  Of  course, 
computational  short  cuts  exist  vjhen  the  matrices  are  very  sparse  as 
these  matrices  are,  but  nonetheless  a  very  large  amount  of  computing  is 
still  required  if  the  processing  is  to  be  done  digitally.^ 


For  this 


19 


reason  we  have  begun  to  investigate  the  practical  use  of  analog  elec¬ 
trical  networks  which  solve  the  linear  equations  directly. 

Two  simple  experimental  devices  for  linear  association  have  been  built 
and  are  undergoing  testing — we  have  called  them'ACORN-I  and  ACORN-II, 
standing  for  "Associative  Content  Retrieval  Network."  Although  the 
ACORNS  are  both  basically  small  scale  demonstration  devices,  some 
interesting  preliminary  experiments  are  being  done  with  them. 

ACORN-I  is  shown  in  Fig.  1;  it  presently  accommodates  a  total  of  82  sen- 
tences  (which  represent  documents)  and  index  terras.  Each  index  term  and 
each  sentence  is  represented  by  a  terminal;  and  the  terminals  a  e  inter¬ 
connected  by  resistors  as  shown.  The  light-colored  wires  terminate  with 
alligator  clips,  and  are  attached  to  the  terminals  for  the  key  words  in 
the  inquiry.  The  relative  voltages  on  these  wires  are  controlled  by  the 
potentiometer  knobs  shown.  As  the  overall  voltage  is  raised  by  turning 
the  large  knob,  current  injected  into  the  network  is  increased  and  the 
neon  bulbs  connected  to  the  sentence  terminals -tight  up  in  the  order  of 
"relevance"  determined  by  the  network.  Some  examples  of  the  behavior 
of  ACORN-I  will  be  discussed  in  the  Appendix, 

ACORN-II  shown  in  Fig.  2,  is  presently  V7ired  for  the  associative  "re¬ 
trieval"  of  libraries  in  the  area  of  political  science;  it  accommodates 
110  libraries  and  A8  topics.  The  topics  are  represented  by  the  ter¬ 
minals  on  the  inner  ring,  the  libraries  by  the  terminals  and  neon  bulbs 
on  the  outer  ring.  The  complete  circuits  of  both  of  these  ACORN 
devices,  except  for  the  normalization  resistors,  are  shown  on  the  front 
panels;  the  only  electronic  components  required  are  resistors  and  neon 
bulbs. 
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B ,  Preliminary  Retrieval  Experiments 

In  the  theoretical  discussion,  we  have  given  only  a  formal  charac¬ 
terization  of  the  ordering  induced  on  documents  in  response  to  a 
question.  In  general,  the  problem  of  measuring  the  effectiveness  of 
such  an  ordering  for  purposes  of  retrieval  is  an  extremely  difficult 
one,  and  one  which  merits  considerable  further  study.  Likewise,  a 
great  deal  still  remains  to  be  learned  about  the  behavior  of  the  linear 
networks  under  different  conditions  of  normalization  and  inquiry  con¬ 
figuration.  Nonetheless,  it  is  possi''ie  to  characterize  the  behavior 
of  the  ACORN  devices  under  the  specific  wiring  configuration  employed 
in  their  construction.  This  behavior  is  illustrated  in  the  Appendix 
by  means  of  some  exercises  run  on  ACOP.N-I,  and  is  summarized  in  the 
following  paragraph.  Although  the  linear  associative  technique  has  not 
yet  been  tested  on  a  large  scale,  there  seem  to  he  excellent  reasons 
for  assuming  that,  thraugh  proper  selection  of  riormi^lization,  the 
behavior  patterns  of  ACORN-I  can  just  as  well  be  realized  for  a  much 
larger  number  tf  documents  and  larger  number  of  index  terms . 

In  summary  of  the  Appendix,  ACORN-I  behaves  as  if  the  retrieval  order¬ 
ing  were  produced  as  a  result  of  the  interaction  of  four  main  factors 
which,  stated  in  simplified  form  in  order  of  decreasing  importance,  are: 

1.  The  number  of  key  terms  shared  by  inquiry  and  sentence;  the  sentences 
containing  the  most  key  terms  specified  in  the  inquiry  will  in 
general  be  listed  first. 

2.  The  frequency  of  use  of  key  terms .  If  several  key  terms  are 
specified  in  an  inquiry,  sentences  containing  a  given  number  of 
the  more  rarely  used  of  these  key  terms  will  tend  to  be  ranked 
above  those  containing  the  same  number  of  frequently  used  key  terms. 

3.  The  number  of  extraneous  index  terms  in  a  sentence.  A  sentence  con¬ 


taining  a  certain  number  of  key  terms  specified  in  the  inquiry  and 


no  other  key  terms  will  tend  to  be  ranked  before  one  containing  the 
same  key  terms  plus  extraneous  (nonassociated)  ones. 

4 .  The  degree  of  indirect  association  between  index  terms  in  a  sentence 
and  key  terms  in  an  inquiry.  Other  factors  being  equal,  sentences 
containing  index  terms  associated  with  those  i\a  the  inquiry  are 
ranked  above  others. 

The  patterns  according  to  which  these  factors  api  ;ar  to  interact  are 
discussed  further  in  the  Appendix;  they  are  fairly  complex,  but  on  the 
whole  quite  pleasing.  It  would  clearly  be  desirable  to  have  a  large 
scale  document  retrieval  system  behave  according  to  these  patterns, 
and  the  writers  are  therefore  planning  tests  of  the  method  on  a  much 
larger  scale. 

C.  Additional  Features  of  Interest 

Three  further  points  of  interest  regarding  the  ACORN  netv;orks  will 
be  mentioned  in  this  paper.  The  -^i’-st  point  is  that  an  ACORN  analog 
network  can  be  used  to  perform  sevara',  functions  vjhich  are  in  practice 
very  difficult  to  do  with  a  conventic-’a!  computerized  document  retrieval 
system;  for  example:  (1)  Retrieving  ^  sat  of  documents  most  closely 
related  to  one  or  more  given  documents.  (2)  Retrieving  a  set  of  index 
terms  ('thesaurus  generation^.  ’’4^  ,  ttrieving  a  .;et  of  documents  and 
index  terms  most  closely  related  to  a  t'-ven  combination  of  index  terms 
and  documents,  etc. 

The  retrieval  of  a  set  of  index  te'ums ,  for  exair.’''’n  may  be  accomplished 
by  applying  currents  to  the  terminal j  for  the  ia-’.ex  terras,  by 

reading  voltages  on  the  terminals  for  <all  othe:  '  ■’ey.  terms,  and  by  select¬ 
ing  those  terms  with  highest  voltages  on  their  terminals. 


The  second  point  of  interest  is  that  there  is  no  need  to  confine  an 
ACORN  to  two  levels  of  elements;  a  three-level  device,  for  example,  could 
recognize  index  terms,  documents  and  scientists.  Linear  association 
and  discrimination,  as  a  matter  of  fact,  can  be  extended  to  any  number 
of  levels.  The  third  and  final  point  relates  to  the  remarkable  insen¬ 
sitivity  of  the  association  mapping  to  variations  in  document-term  con¬ 
nection  strengths.  Pulling  out  or  cutting  a  few  randomly  selected 
wires  in  an  ACORN  generally  has  a  surprisingly  small  effect.  As  a 
matter  of  fact,  neither  of  the  ACORN  machines  has  had  its  connections 
verifie  1  or  'debugged'"  after  its  initial  wiring;  both  worked  wh'  i  they 
were  first  plugged  in.  It  was  not-until  ACORN-II  had  been  used  for 
several  public  demonstrations,  in  fact,  that  it  was  discovered,  quite 
by  accident,  that  three  of  the  wires  were  put  in  incorrectly--the 
effect  of  the  errors  only  showed  up  for  highly  special  request  combina¬ 
tions,  and  then  only  in  disturbing  the  ordering  of  the  third,  fourth, 
and  subsequent  items  retrieved!  This  insensitivity  is  of  course  ex¬ 
plainable  in  terms  of  the  multiplicity  of  indirect  and  redundant  associa¬ 
tion  paths  which  remain  intact  when  a  direct  path  is  severed.  Obviously, 
this  insensitivity  is  of  value  because  it  enables  use  of  wide- tolerance 
components  in  constructing  ACORN  devices.  It  also  suggests  that  the 
retrieval  process  can  indeed  be  made  insensitive  to  minor  variations 
in  indexing- -one  of  the  practical  objectives  which  has  motivated  the 
work  described  in  this  paper. 
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APPENDIX 

RETRIEVAL  BEHAVIOR  OF  ACORN- I 

ACORN-I  is  currently  wired  for  the  associative  retrieval  of  42  sentences 
using  42  index  ter^.  For  the  purpose  of  giving  examples,  the  first  26 
sentences  and  the  first  24  index  terms  are  listed  in  Table  lA;  the  por¬ 
tion  of  the  retrieval  network  employed  is  diagrammed  in  Table  II.  To 
avoid  confusion  we  will  refer  to  an  index  term  which  appears  in  an  in¬ 
quiry  as  a  "key  term." 


In  ACORN-II,  all  normalization  conductances  (C  .  in  equation  (25)) 

oj 

were  chosen  to  be  equal  (22K),  and  all  connection  conductances  in 

equation  (25))  were  chosen  to  be  equal  to  one  of  two  values,  (4.4K  or 
2.2k  as  shown  in  Table  II.)  All  normalization  resistors  for  index 
terms  feed  into  a  single  potentiometer  v;hich  provides  a  controllable 
"degree  of  association,"  since  in  effect  it  enables  partial  control 
of  in  equation  (11).  With  this  knob  in  its  minimum  position, 
is  small  and  association  is  relatively  "narrov?."  With  this  knob  in 
its  maximum  position  is  large,  and  association  as  relatively  "free." 
In  the  latter  case,  associated  terms  often  count  more  than  key  items 
in  determining  relevance  of  a  sentence.  To  illustrate  the  operation 
of  the  other  factors  besides  association  which  determine  the  ordering, 
all  of  the  example  exercises  discussed  were  made  with  the  association 
control  set  to  'narrow." 


1.  Number  of  Key  Terms  Shared  by  the  Question  and  Sentence 

When  association  is  set  to  "narrow"  on  ACORN-I,  the  most  important 
factor  in  determining  the  ranking  of  a  sentence  in  response  to  a  given 
question  appears  to  be  the  number  of  key  terms  shared  by  the  question 
and  sentence.  Given  a  question  containing  m  key  terms,  the  sentences 
listed  first  will  be  those,  if  any,  containing  all  m  key  terms;  next 
any  sentences  containing  m  -  1  of  the  key  terms  will  be  listed,  then 
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sentences  containing  m  -  2  of  the  tn  key  terms,  etc.,  until  finally, 
at  the  bottom  of  the  list,  sentences  containing  none  of  the  key  terms. 

If  association  is  kept  narrow,  this  first  factor  appears  to  exercise 
the  most  influence  in  determining  the  orderings;  the  other  factors 
seirve  primarily  to  determine  the  ordering  vri.thin  a  subset  of  sentences 
having  a  given  number  of  the  key  terms. 

Example  (1) 

What  Do  You  Know  About  (W.  D.  Y.  K.  A.)  Surveillance ,  Tactical  Missions, 
and  Continental  Defense? 

Experimentally  observed  answers,  in  order  of  decreasing  relevance 
given  by  ACOBM-T; 

Sentences  (20),  (16),  (17),  (26).  (24),  etc. 

Explanation  of  ordering: 

Sentence  (20)  is  the  only  one  cun. Mining  ail  three  specified  key  terms. 
Sentences  (16)  and  (17)  each  contain  tvjo  of  the  specified  key  terms. 

Of  these,  sentence  (16)  contair-t  only  one  other  index  term,  and 
sentence  (17)  contains  two  other  index  terms. 

All  the  lower-ordered  sentences  centain  only  one  of  the  specified 
key  terms. 

2.  The  Number  of  Extraneous  Index  Terms  in  a  Sentence 

Given  narrow  association,  a  second  factor  which  appears  to  be  of 
importance  in  determining  the  ordor'ing  in  ACORN-I,  is  the  total  number 
of  inde.i  terms  in  any  sentence  as  coapared  tc  the  number  of  key  terms; 
this  is  illustrated  in  the  example  y’st  given.  Given  a  subset  of 
sentences,  each  of  vAiich  contains  exactly  the  same  j  key  terms  from 
the  question,  the  ones  having  the  fcv?cst  additional  index  terms  will 
tend  to  be  ranked  first.  That  is,  those  sentences  having  exactly  the 
j  key  terms  and  no  others  will  be  listed  first,  then  those  having  the 
j  key  terms,  plus  one  other  index  term  will  be  listed  next,  then  those 
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having  the  j  key  terms,  plus  two  other  index  terras  will  be  listed 
next,  etc.  The  same  factor  tends  to  operate  for  sentences  having  dif¬ 
ferent  subsets  of  j  key  terms;  a  short  answer  containing  a  given 
number  of  key  terms  is  treated  by  the  network  as  preferable  to  a  longer 
answer  containing  those  key  terms  plus  other  index  terms. 

Example  (2) 

W.  D.  Y.  K.  A.  Missile  Interception? 

Experimentally  observed  answers,  in  order  of  decreasing  relevance 
as  given  by  ACORN-I: 

Sentences  (5),  (6),  (25),  etc. 

Explanation  of  ordering: 

All  three  sentences  contain  the  specified  key  term. 

Sentence  (5)  contains  no  other  index  terms. 

Sentence  (6)  contains  t\io  other  index  terms. 

Sentence  (25)  contains  three  other  index  terms. 

3.  The  Frequency  of  Use  of  Key  Terms 

Again  assuming  narrow  associat-’en,  a  third  factor  which  tends  to 
influence  the  order  of  listing  of  sentences  in  ACORN-I  is  the  frequency 
of  use  of  the  specified  key  terms  within  the  corpus  as  a  whole .  Given 
a  set  of  sentences  each  of  which  contains  exactly  j  out  of  m  key 
terms,  those  containing  the  least  frequently  used  key  terms  will  tend 
to  be  listed  first.  For  example,  given  a  set  of  sentences  each  of  v/hich 
containt  one  out  of  several  key  terms  in  the  question  and  no  oth  r  index 
terauj,  those  sentences  containing  the  rarely  used  key  terms  will  be 
listed  before  these  containing  the  more  frequently  used  key  terms.  This 
third  factor  generally  takes  precedence  over  the  second  factor  in  determin¬ 
ing  the  ordering,  but  it  appears  to  be  considerably  less  significant 
than  the  first  factor. 


Example  (3) 

I’L  D.  Y.  K.  A.  Air  Force  and  SAC"^ 


Experimentally  observed  answers,  in  order  of  decreasing  relevance 
as  given  by  ACORN-I: 

Sentences  (22),  (18),  (1),  (9).  (8),  (13).  (10),  (7),  etc. 

Explanation  of  ordering: 

Since  no  sentence  in  the  corpus  contains  both  of  the  specified  key 
terms,  those  listed  contain  only  one  of  the  specified  key  terms 

The  two  sentences  containing  the  more  rarely  used  specified  1 iy 
term,  SAC .  are  listed  first;  of  these  (22)  contains  only  two 
other  index  terms  and  is  listed  first,  (18)  contains  four  other 
index  terms,  and  is  listed  second. 

All  the  remaining  listed  sentences  contain  Air  Force,  a  frequently 
used  index  term. 

4.  Associations  Among  Terms 

The  fourth  and  most  subtle  factor  operating  to  influence  the  order¬ 
ing  in  ACORN-I  is  the  association  itself;  this  factor  relates  to  hov; 
closely  index  terms  in  a  sentence  are  associated  with  key  terms  via 
other  sentences.  Consider,  for  example,  the  subset  of  sentences  which 
have  no  index  terms  in  common  with  the  question.  Even  with  the  associa 
tion  kept  "narrow,  '  these  sentences  will  still  be  ordered  with  respect 
to  one  another,  their  ranking  being  dependent  on  how  closely  the  index 
terms  th  .y  contain  are  associated  with  terms  appearing  in  the  qut  jtion. 
Those  sentences  containing  the  most  highly  associated  terms  v;ill  be 
listed  first. 

Example  (4) 

VI.  D.  Y.  K.  A.  Submarines" 

Experimentally  observed  answers,  in  order  of  decreasing  relevance 
as  given  by  ACORN-I: 


Sentences  (23),  (24),  (11),  (13),  (10),  (12),  etc. 

Explanation  of  ordering: 

Sentences  (23)  and  (24)  are  the  only  ones  containing  the  key  term 
submarines . 

(23)  is  listed  first  because  it  contains  fewer  other  index  terms. 

Sentences  (11),  (13),  (10),  and  (12)  all  contain  the  index  term  Navy 
which  is  linked  to  aubmarines  via  both  sentences  (23)  and  (24). 

The  fourth  factor  operates  in  interaction  v/ith  the  first,  second  and 
third  factors,  its  relative  importance  depending  on  the  setting  of  the 
degree  of  association  knob."  With  association  kept  "narrow"  the  short 
indirect  paths  vrill  have  dominant  influence  in  determining  association. 

Its  influeiice  when  the  association  is  set  to  narrow"  appears  to  be 
relatively  weak,  but  still  useful  when  the  other  factors  do  not  pertain. 
The  interaction  of  all  four  factors  can  be  seen  in  the  following  example; 

Example  (5) 

W.  D.  Y,  K.  A.  Air  Force  and  Military  Establishment? 

Experimentally  observed  answers,  la  order  of  decreasing  relevance 
as  given  by  ACOEIN-I: 

Sentences  (1),  (10),  (11),  (13),  (12),  (2),  (6),  (25),  (23),  (9),  (16), 
(7),  (17),  etc. 

Explanation  of  ordering 

Sentence  (1)  contains  both  specified  key  terms  and  no  other  index 
terms.  Sentence  (10)  contains  both  specified  key  terms  but  tv;o 
other  index  terms  as  well. 

All  remaining  sentences  contain  only  one  of  the  two  specified  key 
-  terms. 

Sentence  (11)  is  the  only  one  of  these  cont  ining  military  establish¬ 
ment  ,  by  far  the  most  rarely  used  of  the  two  specified  key  terms. 

All  remaining  sentences  listed  above  contain  Air  Force. 
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Of  thece  (13)  and  (12)  have  the  strongeot  indirect  links  to  military 
establishiment  via  the  index  term  Navy  which  appears  also  in  (11). 
Sentence  (13)  is  listed  before  (12)  because  the  former  contains 
a  total  of  three  index  terms;  the  latter  contains  four. 

The  desirability  of  the  demonstrated  interaction  of  these  four  factors 
in  determining  the  ordering  is  clear.  The  first  factor  guarantees  that 
the  very  top-most  portion  of  the  list  of  sentences  v?ill  be  precisely 
that  set  of  sentences  which  would  be  delivered  by  a  conventional  term- 
superposition  retrieval  logic,  i.e.,  those  sentences  containing  all  of 
the  specified  key  terms.  Obviously,  in  the  absence  of  additional  infor¬ 
mation,  it  is  desirable  to  have  listed  next  those  sentences  v/hich  have 
one  less  than  the  full  complement  of  key  terms,  etc. 

The  second  factor  gives,  assuming  all  else  is  equal,  precedence  to  those 
sentences  having  fewer  possible  irrelevant  index  terms.  This  will 
result  in  the  briefer  sentences  having  the  desired  key  terms  being 
listed  first;  again,  argvraents  have  often  been  advanced  in  favor  of  such 
a  retrieval  policy. 

The  third  factor  gives  heavier  weight  to  the  less  frequently  used  index 
terms  and  is  therefore  desirable  because  these  terms  convey  more  in¬ 
formation  for  retrieval  purposes.  For  example,  the  sample  corpus  contains 
the  index  term  Air  Force  very  frequently,  being  about  the  Air  Force. 
Accordingly,  when  the  term  Air  Force  is  used  in  an  inquiry,  it  is 
probably  of  considerably  less  importance  than  the  other  key  terms  in 
telling  what  the  inquiry  is  about. 

The  fourth  factor  brings  word  associations  into  the  retrieval  picture, 
and  is  of  great  interest  for  it  offers  hope  of  freeing  the  inquirer  from 
rigid  and  constrained  use  of  index  terms.  This  can  perhaps  best  be 
illustrated  by  means  of  an  example. 


•.1 
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Example  (6) 

Suppose  that  an  inquirer  knows  that  the  sentences  he  is  looking  for 
have  most  to  do  with  missile  interception: 

W.  D-  Y.  K.  A.  Missile  Interception'^ 

Experimentally  observed  answers  in  order  of  decreasing  relevance  as 
given  by  ACORN-I. 

Sentences  (5),  (6),  (25),  (2),  (12),  etc. 

Now  let  us  suppose  that  another  inquirer  has  the  same  interests  as 
the  first  inquirer,  but  does  not  know  enough  to  use  the  term 
missile  interception  (such  might  be  the  case  if  there  were 
thousands  of  recognized  index  terms).  Suppose,  instead,  that 
he  asks  a  question  with  two  key  terms,  Overland  and  Air  Defense. 

W.  D.  Y.  K.  A.  Overland  Air  Defense'^ 

Experimentally  observed  answers,  ;',n  order  of  decreasing  relevance 
as  given  by  ACORN-I. 

Sentences  (2),  (6),  (25),  (12),  (5),  etc. 

Note  that  the  same  sentences  are  listed,  in  slightly  different  order. 

It  must  be  remarked  that  the  indirect  .ord  associations  do  not  always 
work  out  so  neatly  as  in  the  last  example,  but  this  is  to  be  expected 
since  the  sample  corpus  is  far  toe  smsli  to  represent  the  language. 


c 
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TABLE  I A 

PORTION  OF  A  DEMONSTRATION  CORPUS 
FOR  ACORN- I 

1.  The  Air  Force  ic  a  basic  component  of  the  military  establishment. 

1  3 

2.  National  security  in  our  epoch  requires  extensive  facilities  for 

17 

air  defense,  and  the  role  of  our  Air  Force  is  to  provide  these. 

'5  1 

3.  The  Air  Force  maintains  tight  control  over  its  nuclear  warheads  by 

1  2 
means  of  a  variety  of  intricate  procedures. 

4«  We  have  agreed  to  furnish  nuclear  warheads  to  certain  NATO  member 

'2  18 

nations,  but  only  under  tightly  specified  conditions. 

5.  The  problem  of  missile  interception  ic  exceedingly  difficult,  and 

4 

it  is  possible  that  no  effective  solution  may  be  found  for  years. 

6.  An  extremely  important  problem  of  air  defense  currently  faced  by 

5 

the  Air  Force  is  missile  interception. 

1  4 

7.  The  Air  Force  must  exercise  tightei  ..perational  control  over  ICBM's 

1  12 
and  transonic  bombers  than  over  conventional  aircraft. 

22  6 

8.  One  of  the  moot  serious  problems  the  Air  Force  faces  in  developing 

1 

satellite  missiles  is  the  discovery  of  effective  but  yet  fail-safe 
7 

means  for  command  and  control  of  them. 

8 

9.  The  Air  Force  is  developing  integrated  systems  for  command  and  control 

1  3 

of  ICBM's  and  transonic  bombers ,  as  well  as  other  types  of  aircraft . 

12  22  6 
The  three  central  components  of  the  military  establishment  are  the 

3 

Air  Force,  the  Army,  and  the  Navy. 

1  11  9 


10. 
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11.  The  branchec  of  the  military  eetablishirent  primarily  responsible 

3 

intermediate  range  missiles  are  the  Air  Force  and  the  Navy. 

10  1  '  9 

12.  Over-water  air  defense  is  prin’ari'ly  a  joint  ?-'ispon'’ib  *  lity  of  the 

19  “  5  ’ 

Navy  and  the  Air  ?orce. 

9  1 

13.  Strenuous  efforts  are  being  raede  to  facilitate  improvement  of  communi¬ 
cation  between  the  Air  Force,  tb^'  Army ,  and  tl.e  Wavy, 

1  11  '  S 

14.  Programs  are  being  pursued  for  tno  furttier  c>>velopment  and  testing 

of  compact  yet  powerful  nuclear  warneads  for  IC3M' s . 

2  12 

15.  Intelligence  reports  indicate  that  the  Soviets  may  be  behind  in 

20 

the  development  of  cmpact  nuclear  vjarheads  for  second  generation 

? 

ICBM's 

12 

16.  Some  feel  that  the  importance  of  aerial  surveillance  and  of  tactical 

]3  14 

missions  is  too  often  neglected  in  the  Air  Force  today, 

14  1 

17.  Functions  of  the  Air  Force  include  aerial  surx^eillance .  the  main- 

1  1.3 

tenance  of  a  strategic  deterrence,  and  an  ability  to  perform 

15 

tactical  missions. 

14, 

18.  The  continuing  development  of  our  capability  for  strategic  deterrence 

15 

will  eventually  require  that  the  aircraft  of  the  SAC  be  replaced  by 

6  16 

ICBM's .  and  perhaps  eventually  by  satellite  missiles. 

12  7 

19.  Mainstays  of  the  Air  Force  capability  for  strategic  deterrence  are 

1  15 

our  transonic  bombers  and,  increasingly,  our  '/CBM' s . 

22  '  ■  12 

20.  Effective  continental  defense  requires  that  O'T  force  for 

21 

strategic  deterrence  be  strongly  supported  with  capabilities  for 
15 

maintaining  effective  surveillance  and  for  executing  tactical  missions. 

13  14 
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Aircraft  are  still  effective  for  providing  strategic  deterrence; 

6  15 

the  new  transonic  bombers,  in  particular,  will  be  useful  for 
22 

several  years  to  cone. 

22.  The  SAC  has  an  operational  conmand  and  control  system  used  for 

16  8 
aircraft  guidance. 

6 

23 .  The  Navy  is  authorized  to  carry  both  aircraft  and  intermediate  range. 

9  '  6  io 

missiles  on  submarines . 

10  24 

24.  A  primary  surveillance  role  played  by  Navy  aircraft  ir  the  "etection 

13  9  6 

of  submarines . 

24 

25.  Over- land  missile  interception  is  primarily  the  joint  responsibility 

23  4 

of  the  Army  and  the  Air  Force. 

11  1 

26.  Nuclear  warheads  produced  for  applications  of  strategic  deterrence 

2  15 

may  be  in  the  500  kiloton  and  larger  range,  while  those  made  for 

tactical  missions  are  generally  rmaller. 

14 


(The  above  sentences  are  intended  for  demonstration  purposes  only; 
they  are  not  necessarily  accurate  in  ft..;cual  concent  and  do  not  re¬ 
present  statements  of  opinion.) 
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